80
Algorithms for Binary Neural Networks
9DOXH
%DFNWUDFN
*OREDOPLQLPD
%LOLQHDU&RQVWUDLQW
6SDUVH
'HQVH
'5H/8
5HFXUUHQW0RGXOH
3UHGLFW
/RFDOPLQLPD
FIGURE 3.26
An illustration of the RBONN framework. Conventional gradient-based algorithms assume
that hidden variables in bilinear models are independent, which causes an insufficient train-
ing of w due to neglecting the relationship with A as shown in the loss surface (right part).
Our RBONN can help w escape from local minima and achieve a better solution.
can be considered a bilinear optimization problem with the objective function as
arg min
w,α
G(w, α) = ∥w −α ◦bw∥2
2 + R(w),
or
arg min
w,A
G(w, A) = ∥bw −Aw∥2
2 + R(w),
(3.121)
where A = diag( 1
α1 , · · · ,
1
αN ), N is the number of elements in α. ◦denotes the channel-
wise multiplication, and R(·) represents regularization, typically the norm ℓ1 or ℓ2. G(w, A)
includes a bilinear form of Aw widely used in the field of computer vision [52, 162, 97].
Note that the bilinear function is Aw rather than G(w, A) in Equation 6.34. Eq. 6.34 is
rational for BNNs with A and w as bilinear coupled variables, since w is the variable and
bw is just the sign of w.
We introduce a recurrent bilinear optimization for binary neural networks (RBONNs) [259]
by learning the coupled scaling factor and real-valued weight end-to-end. More specifically,
recurrent optimization can efficiently backtrack weights, which will be trained more suffi-
ciently than conventional methods. To this end, a Density-ReLU (DReLU) is introduced
to activate the optimization process based on the density of the variable A. In this way,
we achieve a controlled learning process with a backtracking mechanism by considering the
interaction of variables, thus avoiding the local minima and reaching the performance limit
of BNNs, as shown in Fig. 3.26.
However, such bilinear constraints will lead to an asynchronous convergence problem
and directly affect the learning process of A and w. We can know that the variable with a
slower convergence speed (usually w) is not as sufficiently trained as another faster one.
Moreover, BNNs are based on nonconvex optimization and will suffer more from the local
minima problem due to such an asynchronous convergence. A powerful example is that w
will tendentiously fall into the local optimum with low magnitude when the magnitude of
A is much larger than 0 (due to bw ∈{−1, +1}). On the contrary, w will have a large
magnitude and thus slowly converge when elements of A are close to 0.